Statistical modeling of sequencing errors in SAGE libraries.
نویسندگان
چکیده
MOTIVATION Sequencing errors may bias the gene expression measurements made by Serial Analysis of Gene Expression (SAGE). They may introduce non-existent tags at low abundance and decrease the real abundance of other tags. These effects are increased in the longer tags generated in LongSAGE libraries. Current sequencing technology generates quite accurate estimates of sequencing error rates. Here we make use of the sequence neighborhood of SAGE tags and error estimates from the base-calling software to correct for such errors. RESULTS We introduce a statistical model for the propagation of sequencing errors in SAGE and suggest an Expectation-Maximization (EM) algorithm to correct for them given observed sequences in a library and base-calling error estimates. We tested our method using simulated and experimental SAGE libraries. When comparing SAGE libraries, we found that sequencing errors can introduce considerable bias. High abundance tags may be falsely called as significantly differentially expressed, especially when comparing libraries with different levels of sequencing errors and/or of different size. Truly, differentially expressed tags have decreased significance as 'true'-tag counts are generally underestimated. This may alter if tags near the threshold of differential expression are called significant. Moreover, the number of different transcripts present in a library is overestimated as false tags are introduced at low abundance. Our correction method adjusts the tag counts to be closer to the true counts and is able to partly correct for biases introduced by sequencing errors. AVAILABILITY An implementation using R is distributed as an R package. An online version is available at http://tagcalling.mbgproject.org
منابع مشابه
Correction of sequence-based artifacts in serial analysis of gene expression
MOTIVATION Serial Analysis of Gene Expression (SAGE) is a powerful technology for measuring global gene expression, through rapid generation of large numbers of transcript tags. Beyond their intrinsic value in differential gene expression analysis, SAGE tag collections afford abundant information on the size and shape of the sample transcriptome and can accelerate novel gene discovery. These la...
متن کاملSerial Analysis of Gene Expression (SAGE) - Sequencing Errors
Serial Analysis of Gene Expression (SAGE) is a technique to study overall gene expression in different (normal or disease) tissues. Results take a form of a so-called SAGE library for each of the tissues studied. A SAGE library is a set of text-strings (typically 10base-pairs long), called tags. A tag is representative for a gene that is active in a particular cell or tissue. From a statistical...
متن کاملIdentification and prevention of a GC content bias in SAGE libraries.
Serial Analysis of Gene Expression (SAGE) is becoming a widely used gene expression profiling method for the study of development, cancer and other human diseases. Investigators using SAGE rely heavily on the quantitative aspect of this method for cataloging gene expression and comparing multiple SAGE libraries. We have developed additional computational and statistical tools to assess the qual...
متن کاملCan transcriptome size be estimated from SAGE catalogs?
MOTIVATION SAGE (Serial Analysis of Gene Expression) can be used to estimate the number of unique transcripts in a transcriptome. A simple estimator that corrects for sequencing and sampling errors was applied to a SAGE library (137 832 tags) obtained from mouse embryonic stem cells, and also to Monte Carlo simulated libraries generated using assumed distributions of 'true' expression levels co...
متن کاملTissue-specific transcript annotation and expression profiling with complementary next-generation sequencing technologies
Next-generation sequencing is excellently suited to evaluate the abundance of mRNAs to study gene expression. Here we compare two alternative technologies, cap analysis of gene expression (CAGE) and serial analysis of gene expression (SAGE), for the same RNA samples. Along with quantifying gene expression levels, CAGE can be used to identify tissue-specific transcription start sites, while SAGE...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Bioinformatics
دوره 20 Suppl 1 شماره
صفحات -
تاریخ انتشار 2004